Understanding customer behavior and characteristics is usually a critical part of the marketing operations of any business or organization, with direct consequences on sales & marketing strategy. Customer segmentation is often viewed as a means to achieve a better return on investment from marketing efforts, and make organizations more efficient in terms of utilizing their money, time, and other critical resources in custom marketing strategies for different groups of customers based on their unique needs and motivations.
In the context of marketing analytics, customer segmentation has a vital role to play in optimizing ROI. It typically involves analyzing metrics around customer engagement with various marketing activities including but not limited to, ATL (above the line) marketing activities, BTL (below the line) campaigns, and targeting personalized offers. Typically, the variables of interest are customer profiles, campaign conversion rates, and information associated with various marketing channels. Based on these feature categories, the target is to create the best possible customer segments from the given data.
It has been understood from various research that customer segmentation often has a huge impact on people’s email engagement. Segmented campaigns often see over 100% more clicks than non-segmented campaigns, and email marketers who have segmented their audience before campaigning have reported a 6-7 times growth in their overall revenue. It has also been observed in various contexts that in today’s world, individual customers prefer personalized communications and offerings that cater to their particular interests.
Using Unsupervised Learning ideas such as Dimensionality Reduction and Clustering, the objective is to come up with the best possible customer segments using the given customer dataset.
The key questions that need to be answered are:
We are trying to group similar customers together or create customer segments using known unsupervised learning algorithms. We will be solving the given problem using best practices and techniques of dimensionality reduction and clustering. Using data science, we will be seeking answers to the following questions.
The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook will give you a direction on what steps need to be taken in order to get a viable solution to the problem. Please note that this is just one way of doing this. There can be other 'creative' ways to solve the problem and we urge you to feel free and explore them as an 'optional' exercise.
In the notebook, there are markdown cells called - Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.
The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.
All the outputs in the notebook are just for reference and can be different if you follow a different approach.
There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they wish to explore different techniques.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# To compute distances
from scipy.spatial.distance import cdist
# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# To visualize the elbow curve and Silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer
# Importing PCA
from sklearn.decomposition import PCA
# To encode the variable
from sklearn.preprocessing import LabelEncoder
# Importing TSNE
from sklearn.manifold import TSNE
# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# To compute distances
from scipy.spatial.distance import pdist
# To import K-Medoids
from sklearn_extra.cluster import KMedoids
# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# loading the dataset
data = pd.read_csv("marketing_campaign.csv")
# Print the shape of the data
data.shape
(2240, 27)
# View first 5 rows
data.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
# View last 5 rows Hint: Use tail() method
data.tail()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2235 | 10870 | 1967 | Graduation | Married | 61223.0 | 0 | 1 | 13-06-2013 | 46 | 709 | ... | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2236 | 4001 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 10-06-2014 | 56 | 406 | ... | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2237 | 7270 | 1981 | Graduation | Divorced | 56981.0 | 0 | 0 | 25-01-2014 | 91 | 908 | ... | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2238 | 8235 | 1956 | Master | Together | 69245.0 | 0 | 1 | 24-01-2014 | 8 | 428 | ... | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2239 | 9405 | 1954 | PhD | Married | 52869.0 | 1 | 1 | 15-10-2012 | 40 | 84 | ... | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 27 columns
# Check the datatypes of each column. Hint: Use info() method
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
# Find the percentage of missing values in each column of the data
for col in data.columns:
print(col,"-",(data.isnull().sum()/data.shape[0] * 100)[col]," %")
ID - 0.0 % Year_Birth - 0.0 % Education - 0.0 % Marital_Status - 0.0 % Income - 1.0714285714285714 % Kidhome - 0.0 % Teenhome - 0.0 % Dt_Customer - 0.0 % Recency - 0.0 % MntWines - 0.0 % MntFruits - 0.0 % MntMeatProducts - 0.0 % MntFishProducts - 0.0 % MntSweetProducts - 0.0 % MntGoldProds - 0.0 % NumDealsPurchases - 0.0 % NumWebPurchases - 0.0 % NumCatalogPurchases - 0.0 % NumStorePurchases - 0.0 % NumWebVisitsMonth - 0.0 % AcceptedCmp3 - 0.0 % AcceptedCmp4 - 0.0 % AcceptedCmp5 - 0.0 % AcceptedCmp1 - 0.0 % AcceptedCmp2 - 0.0 % Complain - 0.0 % Response - 0.0 %
We can observe that ID has no null values. Also the number of unique values are equal to the number of observations. So, ID looks like an index for the data entry and such a column would not be useful in providing any predictive power for our analysis. Hence, it can be dropped.
Dropping the ID column
# Remove ID column from data. Hint: Use inplace = True
data.drop(columns = ["ID"], inplace= True)
# Explore basic summary statistics of numeric variables. Hint: Use describe() method.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2240.0 | 1968.805804 | 11.984069 | 1893.0 | 1959.00 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2216.0 | 52247.251354 | 25173.076661 | 1730.0 | 35303.00 | 51381.5 | 68522.00 | 666666.0 |
| Kidhome | 2240.0 | 0.444196 | 0.538398 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2240.0 | 0.506250 | 0.544538 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Recency | 2240.0 | 49.109375 | 28.962453 | 0.0 | 24.00 | 49.0 | 74.00 | 99.0 |
| MntWines | 2240.0 | 303.935714 | 336.597393 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | 26.302232 | 39.773434 | 0.0 | 1.00 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2240.0 | 166.950000 | 225.715373 | 0.0 | 16.00 | 67.0 | 232.00 | 1725.0 |
| MntFishProducts | 2240.0 | 37.525446 | 54.628979 | 0.0 | 3.00 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2240.0 | 27.062946 | 41.280498 | 0.0 | 1.00 | 8.0 | 33.00 | 263.0 |
| MntGoldProds | 2240.0 | 44.021875 | 52.167439 | 0.0 | 9.00 | 24.0 | 56.00 | 362.0 |
| NumDealsPurchases | 2240.0 | 2.325000 | 1.932238 | 0.0 | 1.00 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2240.0 | 4.084821 | 2.778714 | 0.0 | 2.00 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2240.0 | 2.662054 | 2.923101 | 0.0 | 0.00 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2240.0 | 5.790179 | 3.250958 | 0.0 | 3.00 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2240.0 | 5.316518 | 2.426645 | 0.0 | 3.00 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2240.0 | 0.074554 | 0.262728 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2240.0 | 0.064286 | 0.245316 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2240.0 | 0.012946 | 0.113069 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Complain | 2240.0 | 0.009375 | 0.096391 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Response | 2240.0 | 0.149107 | 0.356274 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
# List of the categorical columns in the data
cols = ["Education", "Marital_Status", "Kidhome", "Teenhome", "Complain"]
Number of unique observations in each category
for column in cols:
print("Unique values in", column, "are :")
print(data[column].value_counts(normalize=True))
print("*" * 50)
Unique values in Education are : Graduation 0.503125 PhD 0.216964 Master 0.165179 2n Cycle 0.090625 Basic 0.024107 Name: Education, dtype: float64 ************************************************** Unique values in Marital_Status are : Married 0.385714 Together 0.258929 Single 0.214286 Divorced 0.103571 Widow 0.034375 Alone 0.001339 Absurd 0.000893 YOLO 0.000893 Name: Marital_Status, dtype: float64 ************************************************** Unique values in Kidhome are : 0 0.577232 1 0.401339 2 0.021429 Name: Kidhome, dtype: float64 ************************************************** Unique values in Teenhome are : 0 0.516964 1 0.459821 2 0.023214 Name: Teenhome, dtype: float64 ************************************************** Unique values in Complain are : 0 0.990625 1 0.009375 Name: Complain, dtype: float64 **************************************************
Think About It:
# Replace the category "2n Cycle" with the category "Master"
data["Education"].replace("2n Cycle", "Master", inplace = True) # Hint: Use the replace() method and inplace=True
# Replace the categories "Alone", "Abusrd", "YOLO" with the category "Single"
data["Marital_Status"].replace(["Alone","Absurd", "YOLO"], "Single", inplace = True) # Hint: Use the replace() method and inplace=True
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Histograms help to visualize and describe numerical data. We can also use other plots like box plot to analyze the numerical columns.
# Create histogram for the Income feature
plt.figure(figsize=(15, 7))
sns.histplot(x="Income", data=data)
plt.show()
We could observe some extreme value on the right side of the distribution of the 'Income' feature. Let's use a box plot as it is more suitable to identify extreme values in the data.
# Plot the boxplot
sns.boxplot(data=data, x="Income", showmeans=True, color="violet")
<AxesSubplot:xlabel='Income'>
Think About It
# Calculating the upper whisker for the Income variable
Q1 = data.quantile(q=0.25) # Finding the first quartile
Q3 = data.quantile(q=0.75) # Finding the third quartile
IQR = Q3 - Q1 # Finding the Inter Quartile Range
upper_whisker = (Q3 + 1.5 * IQR)["Income"] # Calculating the Upper Whisker for the Income variable
print(upper_whisker) # Printing Upper Whisker
118350.5
# Let's check the observations with extreme value for the Income variable
data[data.Income > upper_whisker]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 1973 | PhD | Married | 157243.0 | 0 | 1 | 01-03-2014 | 98 | 20 | 2 | ... | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 617 | 1976 | PhD | Together | 162397.0 | 1 | 1 | 03-06-2013 | 31 | 85 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 655 | 1975 | Graduation | Divorced | 153924.0 | 0 | 0 | 07-02-2014 | 81 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 687 | 1982 | PhD | Married | 160803.0 | 0 | 0 | 04-08-2012 | 21 | 55 | 16 | ... | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1300 | 1971 | Master | Together | 157733.0 | 1 | 0 | 04-06-2013 | 37 | 39 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1653 | 1977 | Graduation | Together | 157146.0 | 0 | 0 | 29-04-2013 | 13 | 1 | 0 | ... | 28 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2132 | 1949 | PhD | Married | 156924.0 | 0 | 0 | 29-08-2013 | 85 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2233 | 1977 | Graduation | Together | 666666.0 | 1 | 0 | 02-06-2013 | 23 | 9 | 14 | ... | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 rows × 26 columns
data[data.Income > upper_whisker].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 8.0 | 1972.500 | 10.028531 | 1949.0 | 1972.50 | 1975.5 | 1977.00 | 1982.0 |
| Income | 8.0 | 221604.500 | 179850.404431 | 153924.0 | 157090.50 | 157488.0 | 161201.50 | 666666.0 |
| Kidhome | 8.0 | 0.375 | 0.517549 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| Teenhome | 8.0 | 0.250 | 0.462910 | 0.0 | 0.00 | 0.0 | 0.25 | 1.0 |
| Recency | 8.0 | 48.625 | 33.687376 | 13.0 | 22.50 | 34.0 | 82.00 | 98.0 |
| MntWines | 8.0 | 26.500 | 30.798887 | 1.0 | 1.75 | 14.5 | 43.00 | 85.0 |
| MntFruits | 8.0 | 4.500 | 6.524678 | 0.0 | 1.00 | 1.0 | 5.00 | 16.0 |
| MntMeatProducts | 8.0 | 621.875 | 846.511402 | 1.0 | 7.25 | 17.0 | 1592.00 | 1725.0 |
| MntFishProducts | 8.0 | 4.250 | 5.650537 | 1.0 | 1.00 | 2.0 | 3.50 | 17.0 |
| MntSweetProducts | 8.0 | 1.250 | 0.886405 | 0.0 | 1.00 | 1.0 | 1.25 | 3.0 |
| MntGoldProds | 8.0 | 3.750 | 4.131759 | 1.0 | 1.00 | 1.5 | 5.00 | 12.0 |
| NumDealsPurchases | 8.0 | 4.250 | 6.777062 | 0.0 | 0.00 | 0.0 | 6.75 | 15.0 |
| NumWebPurchases | 8.0 | 0.500 | 1.069045 | 0.0 | 0.00 | 0.0 | 0.25 | 3.0 |
| NumCatalogPurchases | 8.0 | 9.875 | 13.484780 | 0.0 | 0.00 | 0.5 | 23.50 | 28.0 |
| NumStorePurchases | 8.0 | 0.750 | 1.035098 | 0.0 | 0.00 | 0.5 | 1.00 | 3.0 |
| NumWebVisitsMonth | 8.0 | 1.125 | 2.031010 | 0.0 | 0.00 | 0.5 | 1.00 | 6.0 |
| AcceptedCmp3 | 8.0 | 0.000 | 0.000000 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |
| AcceptedCmp4 | 8.0 | 0.000 | 0.000000 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |
| AcceptedCmp5 | 8.0 | 0.000 | 0.000000 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |
| AcceptedCmp1 | 8.0 | 0.000 | 0.000000 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |
| AcceptedCmp2 | 8.0 | 0.000 | 0.000000 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |
| Complain | 8.0 | 0.000 | 0.000000 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |
| Response | 8.0 | 0.000 | 0.000000 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |
data[data.Income > upper_whisker][["NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases"]]
| NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | |
|---|---|---|---|---|
| 164 | 15 | 0 | 22 | 0 |
| 617 | 0 | 0 | 0 | 1 |
| 655 | 0 | 0 | 0 | 0 |
| 687 | 15 | 0 | 28 | 1 |
| 1300 | 0 | 1 | 0 | 1 |
| 1653 | 0 | 0 | 28 | 0 |
| 2132 | 0 | 0 | 0 | 0 |
| 2233 | 4 | 3 | 1 | 3 |
Think About It:
# Check the 99.5% percentile value for the Income variable
print(data.quantile(q=0.995)["Income"])
102145.75000000003
# Dropping observations identified as outliers
data.drop(index=[617,655,1300,2132,2233], inplace=True) # Pass the indices of the observations (separated by a comma) to drop them
Now, let's check the distribution of the Income variable after dropping outliers.
# Plot histogram and 'Income'
def draw_histogram(col):
plt.figure(figsize=(15, 7))
sns.histplot(x=col, data=data)
plt.show()
draw_histogram("Income")
# Plot the histogram for 'MntWines'
draw_histogram("MntWines")
# Plot the histogram for 'MntFruits'
draw_histogram("MntFruits")
# Plot the histogram for 'MntMeatProducts'
draw_histogram("MntMeatProducts")
# Plot the histogram for 'MntFishProduct'
draw_histogram("MntFishProducts")
# Plot the histogram for 'MntSweetProducts'
draw_histogram("MntSweetProducts")
# Plot the histogram for 'MntGoldProducts'
draw_histogram("MntGoldProds")
numeric_cols = ["NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases","NumStorePurchases","NumWebVisitsMonth"
,"Year_Birth","Kidhome","Teenhome","Recency","AcceptedCmp1","AcceptedCmp2","AcceptedCmp3","AcceptedCmp4"
,"AcceptedCmp5","Complain","Response"]
for col in numeric_cols:
draw_histogram(col)
Let us write a function that will help us create bar plots that indicate the percentage for each category. This function takes the categorical column as the input and returns the bar plot for the variable.
def perc_on_bar(z):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(data[z]) # Length of the column
plt.figure(figsize=(15,5))
ax = sns.countplot(data[z],palette='Paired',order = data[z].value_counts().index)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # Percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # Width of the plot
y = p.get_y() + p.get_height() # Height of the plot
ax.annotate(percentage, (x, y), size = 12) # Annotate the percentage
plt.show() # Show the plot
We have analyzed different categorical and numerical variables. Now, let's check how different variables are related to each other.
Heat map can show a 2D correlation matrix between numerical features.
numeric_cols = ['Year_Birth', 'Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Response']
plt.figure(figsize=(15, 7)) # Setting the plot size
sns.heatmap(data[numeric_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") # Plotting the correlation plot
plt.show()
The above correlation heatmap only shows the relationship between numerical variables. Let's check the relationship of numerical variables with categorical variables.
print(sns.barplot(x="Education", y="Income", data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
# Plot the bar plot for Marital_Status and Income
print(sns.barplot(x="Marital_Status", y="Income", data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
# Plot the bar plot for Kidhome and Income
print(sns.barplot(x="Kidhome", y="Income", data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
We can also visualize the relationship between two categorical variables.
# Plot the bar plot for Marital_Status and Kidhome
pd.crosstab(data["Marital_Status"],data["Kidhome"]).plot(kind='bar',stacked=False)
<AxesSubplot:xlabel='Marital_Status'>
In this section, we will first prepare our dataset for analysis.
Think About It:
# Extract only the year from the Year_Birth variable and subtracting it from 2016 will give us the age of the customer at the time of data collection in 2016
data["Age"] = 2016 - pd.to_datetime(data["Year_Birth"], format="%Y").apply(lambda x: x.year)
# Sorting the values in ascending order
data["Age"].sort_values()
1170 20
46 20
1850 21
696 21
2213 21
..
39 73
358 73
894 73
424 75
1950 76
Name: Age, Length: 2232, dtype: int64
Think About It:
Now, let's check the distribution of age in the data.
# Plot histogram to check the distribution of age
draw_histogram("Age")
# Add Kidhome and Teenhome variables to create the new feature called "Kids"
data["Kids"] = data["Kidhome"] + data["Teenhome"]
# Check the unique categories in Marial_Status
data["Marital_Status"].unique()
array(['Single', 'Together', 'Married', 'Divorced', 'Widow'], dtype=object)
# Replace "Married" and "Together" with "Relationship"
data["Marital_Status"].replace(["Married","Together"], "Relationship", inplace = True)
# Replace "Divorced" and "Widow" with "Single"
data["Marital_Status"].replace(["Single","Divorced", "Widow"], "Single", inplace = True)
# Create a new feature called "Status" by replacing "Single" with 1 and "Relationship" with 2 in Marital_Status
data["Status"] = data["Marital_Status"].replace({"Single": 1, "Relationship": 2})
# Add two variables Status and Kids to get the total number of persons in each family
data["Family_Size"] = data["Kids"] + data["Status"]
# Create a new feature
# Add the amount spent on each of product 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'
expense_cols = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
data["Expenses"] = data[expense_cols].sum(axis=1)
# Create a new feature
# Add the number of purchases from each channel 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases'
purchase_cols = ['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']
data["NumTotalPurchases"] = data[purchase_cols].sum(axis=1)
# Converting Dt_customer variable to Python date time object
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"])
Let's check the max and min of the date.
# Check the minimum of the date
# Hint: Use the min() method
data["Dt_Customer"].min()
Timestamp('2012-01-08 00:00:00')
# Check the maximum of the date
# Hint: Use the max() method
data["Dt_Customer"].max()
Timestamp('2014-12-06 00:00:00')
Think About It:
# Assigning date to the day variable
data["day"] = "01-01-2015"
# Converting the variable day to Python datetime object
data["day"] = pd.to_datetime(data.day)
data["Engaged_in_days"] = (data["day"] - data["Dt_Customer"]).dt.days
# Add all the campaign related variables to get the total number of accepted campaigns by a customer
# "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "Response"
data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"] + data["Response"]
# Divide the "Expenses" by "NumTotalPurchases" to create the new feature AmountPerPurchase
data['AmountPerPurchase'] = data["Expenses"] / data["NumTotalPurchases"]
Now, let's check the maximum value of the AmountPerPurchase.
# Check the max value
# Hint: Use max() function
data['AmountPerPurchase'].max()
inf
Think About It:
# Find how many observations have NumTotalPurchases equal to 0
data[data['NumTotalPurchases'] == 0]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | day | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 981 | 1965 | Graduation | Single | 4861.0 | 0 | 0 | 2014-06-22 | 20 | 2 | 1 | ... | 51 | 0 | 1 | 1 | 6 | 0 | 2015-01-01 | 193 | 0 | inf |
| 1524 | 1973 | Graduation | Single | 3502.0 | 1 | 0 | 2013-04-13 | 56 | 2 | 1 | ... | 43 | 1 | 1 | 2 | 5 | 0 | 2015-01-01 | 628 | 0 | inf |
2 rows × 36 columns
# Drop the observations with NumTotalPurchases equal to 0, using their indices
data.drop(index=[981,1524], inplace=True)
Now, let's check the distribution of values in AmountPerPurchase column.
# Check the summary statistics of the AmountPerPurchase variable
data['AmountPerPurchase'].describe()
count 2230.000000 mean 33.294200 std 45.015305 min 0.533333 25% 9.714286 50% 23.379808 75% 45.313963 max 1679.000000 Name: AmountPerPurchase, dtype: float64
# Plot the histogram for the AmountPerPurchas variable
draw_histogram("AmountPerPurchase")
# Impute the missing values for the Income variable with the median
data['Income'].fillna(data.Income.median(), inplace = True)
Now that we are done with data preprocessing, let's visualize new features against the new income variable we have after imputing missing values.
# Plot the scatter plot with Expenses on Y-axis and Income on X-axis
plt.figure(figsize=(20, 10)) # Setting the plot size
sns.scatterplot(x="Income", y="Expenses", data=data) # Hint: Use sns.scatterplot()
plt.xticks(fontsize=16) # Font size of X-label
plt.yticks(fontsize=16) # Font size of Y-label
plt.xlabel("Income", fontsize=20, labelpad=20) # Title of X-axis
plt.ylabel("Expenses", fontsize=20, labelpad=20) # Title of Y-axis
Text(0, 0.5, 'Expenses')
# Plot the bar plot for Family Size on X-axis and Income on Y-axis
print(sns.barplot(x="Family_Size", y="Income", data=data))
AxesSubplot(0.125,0.11;0.775x0.77)
We will explore the following unsupervised learning techniques to identify the best fit for the given dataset:
A. Dimensionality Reduction
We will first try to reduce the dimensions of the given dataset using the following techniques:
B. Clustering
After reducing the dimensionality of the dataset, we will use the following different methods to create clusters.
Dimensionality Reduction We will use unsupervised learning techniques, to reduce the dimensionality and bring it to a lower dimension so that it becomes understandable and easy to interpret the insights found after exploration and analysis. We will use two major techniques that we will focus on to reduce the dimensions of the data:
PCA It is a technique in unsupervised learning that is used to reduce the dimensionality of the given data. It is a linear dimensionality reduction technique. PCA confirms that most of the variance is preserved while reducing the dimensions so that the data with reduced dimensions will still give equally trustworthy inferences and insights during any analytical process.
Correlation is generally used when there are features in the data that have different units of measurements. To nullify the effect of scale, normalization needs to be done. We will scale the data so that different units of measurement do not impact the variance of data.
t-SNE SNE technique - It is a probabilistic approach to transform objects from high dimension to low dimension. In the given dataset, it takes the individual data points and fits a Gaussian distribution on it. In terms of approach, this method is totally different from the principal component analysis. In the lower dimension, it creates another Gaussian in such a way that the two distributions look similar to each other. Two distributions can be considered to be similar if their corresponding dimensions are comparable to each other. In terms of approach this method.
The t-SNE method differs from normal SNE in terms of the distribution used to replicate the data from higher to lower dimensions. Instead of using Gaussian distribution here, t-distribution is deployed on top of every data point. It is a non-linear dimensionality reduction technique.
Clustering After reducing the dimensionality of the dataset, we will use the following different methods to create clusters.
K-means clustering In K-means Clustering, we create K clusters from the given data. We start by deciding on a fixed number of clusters we expect to find in the data, denoted by K. Data visualization techniques like PCA and t-SNE, and other methods like elbow plots can help give us an idea about a suitable value for K.
K-Medoids Clustering K-means clustering uses means to find centroids in each iteration, outliers impact the final clusters. K-medoids clutering uses medians instead of means in each iteration and hence, reduces the impact of outliers.
Gaussian Mixture Models GMM is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. Hence, the distribution of samples within each cluster is modeled by a Gaussian. Parameter estimates for Gaussian distributions,
DBSCAN One of the major benefits of DBSCAN is that it handles the outliers existing in the data all by itself.
Quality of Clustering - Silhouette Plot We will use Silhoutte plot to measure the quality of clusters. Silhoutte plot displays Silhoutte score for each cluster which ranges from -1 to 1. A value of 1 means clusters are well apart from each other and clearly distinguished. A value of 0 means clusters are indifferent, or we can say that the distance between clusters is not significant. A value of -1 means clusters are assigned in the wrong way. Generally, a value of 0.5 is used as a cut-off value to say that the data points are well clustered.
Note: This is in continuation to the data preprocessing we did in Milestone 1. Results might differ if you have followed different steps in data preprocessing.
!pip install yellowbrick --user
Requirement already satisfied: yellowbrick in c:\users\ashis\appdata\roaming\python\python39\site-packages (1.4) Requirement already satisfied: scikit-learn>=1.0.0 in c:\programdata\anaconda3\lib\site-packages (from yellowbrick) (1.0.2) Requirement already satisfied: numpy>=1.16.0 in c:\programdata\anaconda3\lib\site-packages (from yellowbrick) (1.21.5) Requirement already satisfied: cycler>=0.10.0 in c:\programdata\anaconda3\lib\site-packages (from yellowbrick) (0.11.0) Requirement already satisfied: scipy>=1.0.0 in c:\programdata\anaconda3\lib\site-packages (from yellowbrick) (1.7.3) Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in c:\programdata\anaconda3\lib\site-packages (from yellowbrick) (3.5.1) Requirement already satisfied: pillow>=6.2.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (9.0.1) Requirement already satisfied: kiwisolver>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.3.2) Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.25.0) Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (21.3) Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2) Requirement already satisfied: pyparsing>=2.2.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.4) Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0) Requirement already satisfied: joblib>=0.11 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.1.0) Note: you may need to restart the kernel to use updated packages.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# To compute distances
from scipy.spatial.distance import cdist
# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# To visualize the elbow curve and Silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer
# Importing PCA
from sklearn.decomposition import PCA
# To encode the variable
from sklearn.preprocessing import LabelEncoder
# Importing TSNE
from sklearn.manifold import TSNE
# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# To compute distances
from scipy.spatial.distance import pdist
# To import K-Medoids
from sklearn_extra.cluster import KMedoids
# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv("marketing_campaign.csv")
data.drop(columns = ["ID"], inplace= True)
data["Education"].replace("2n Cycle", "Master", inplace = True) # Hint: Use the replace() method and inplace=True
data["Marital_Status"].replace(["Alone","Absurd", "YOLO"], "Single", inplace = True) # Hint: Use the replace() method and inplace=True
Q1 = data.quantile(q=0.25) # Finding the first quartile
Q3 = data.quantile(q=0.75) # Finding the third quartile
IQR = Q3 - Q1 # Finding the Inter Quartile Range
upper_whisker = (Q3 + 1.5 * IQR)["Income"] # Calculating the Upper Whisker for the Income variable
data[data.Income > upper_whisker][["NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases", "NumStorePurchases"]]
data.drop(index=[617,655,1300,2132,2233], inplace=True) # Pass the indices of the observations (separated by a comma) to drop them
def draw_histogram(col):
plt.figure(figsize=(15, 7))
sns.histplot(x=col, data=data)
plt.show()
draw_histogram("Income")
data["Age"] = 2016 - pd.to_datetime(data["Year_Birth"], format="%Y").apply(lambda x: x.year)
# Sorting the values in ascending order
data["Age"].sort_values()
1170 20
46 20
747 21
2213 21
995 21
...
424 75
1950 76
192 116
339 117
239 123
Name: Age, Length: 2235, dtype: int64
data.drop(index = [192,339,239], inplace=True)
data["Kids"] = data["Kidhome"] + data["Teenhome"]
data["Marital_Status"].replace(["Married","Together"], "Relationship", inplace = True)
data["Marital_Status"].replace(["Single","Divorced", "Widow"], "Single", inplace = True)
data["Status"] = data["Marital_Status"].replace({"Single": 1, "Relationship": 2})
data["Family_Size"] = data["Kids"] + data["Status"]
expense_cols = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
data["Expenses"] = data[expense_cols].sum(axis=1)
purchase_cols = ['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']
data["NumTotalPurchases"] = data[purchase_cols].sum(axis=1)
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"])
# Assigning date to the day variable
data["day"] = "01-01-2015"
# Converting the variable day to Python datetime object
data["day"] = pd.to_datetime(data.day)
data["Engaged_in_days"] = (data["day"] - data["Dt_Customer"]).dt.days
data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"] + data["Response"]
data['AmountPerPurchase'] = data["Expenses"] / data["NumTotalPurchases"]
data.drop(index=[981,1524], inplace=True)
data['Income'].fillna(data.Income.median(), inplace = True)
data.shape
(2230, 36)
The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
Here, we will use the behavioral attributes for segmentation and drop the demographic attributes like Income, Age, and Family_Size. In addition to this, we need to drop some other columns which are mentioned below.
Dt_Customer: We have created the Engaged_in_days variable using the Dt_Customer variable. Hence, we can drop this variable as it will not help with segmentation.Complain: About 95% of the customers didn't complain and have the same value for this column. This variable will not have a major impact on segmentation. Hence, we can drop this variable. day: We have created the Engaged_in_days variable using the 'day' variable. Hence, we can drop this variable as it will not help with segmentation.Status: This column was created just to get the Family_Size variable that contains the information about the Status. Hence, we can drop this variable.Education and Marital_Status, Kids, Kidhome, and Teenhome as distance-based algorithms cannot use the default distance like Euclidean to find the distance between categorical and numerical variables.AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, and Response for which we have create the variable TotalAcceptedCmp which is the aggregate of all these variables.# Dropping all the irrelevant columns and storing in data_model
data_model = data.drop(
columns=[
"Year_Birth",
"Dt_Customer",
"day",
"Complain",
"Response",
"AcceptedCmp1",
"AcceptedCmp2",
"AcceptedCmp3",
"AcceptedCmp4",
"AcceptedCmp5",
"Marital_Status",
"Status",
"Kids",
'Education',
'Kidhome',
'Teenhome', 'Income','Age', 'Family_Size'
],
axis=1,
)
# Check the shape of new data
data_model.shape
(2230, 17)
# Check first five rows of new data
data_model.head()
| Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 1617 | 25 | 997 | 1 | 64.680000 |
| 1 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 27 | 6 | 151 | 0 | 4.500000 |
| 2 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 776 | 21 | 498 | 0 | 36.952381 |
| 3 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 53 | 8 | 91 | 0 | 6.625000 |
| 4 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 422 | 19 | 347 | 0 | 22.210526 |
Let's plot the correlation plot after we've removed the irrelevant variables.
# Plot the correlation plot for new data
plt.figure(figsize=(15, 7)) # Setting the plot size
sns.heatmap(data_model.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") # Plotting the correlation plot
plt.show()
Observations and Insights:
What is feature scaling?
Feature scaling is a class of statistical techniques that, as the name implies, scales the features of our data so that they all have a similar range. You'll understand better if we look at an example:
If you have multiple independent variables like Age, Income, and Amount related variables, with their range as (18–100 Years), (25K–75K), and (100–200), respectively, feature scaling would help them all to be in the same range.
Why feature scaling is important in Unsupervised Learning?
Feature scaling is especially relevant in machine learning models that compute some sort of distance metric as we do in most clustering algorithms, for example, K-Means.
So, scaling should be done to avoid the problem of one feature dominating over others because the unsupervised learning algorithm uses distance to find the similarity between data points.
Let's scale the data
Standard Scaler: StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.
# Applying standard scaler on new data
scaler = StandardScaler() # Initialize the Standard Scaler
df_scaled = scaler.fit_transform(data_model) # fit_transform the scaler function on new data
df_scaled = pd.DataFrame(df_scaled, columns=data_model.columns) # Converting the embeddings to a dataframe
df_scaled.head()
| Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.306982 | 0.980594 | 1.551219 | 1.676726 | 2.457442 | 1.472507 | 0.843113 | 0.345710 | 1.405761 | 2.506784 | -0.558344 | 0.698098 | 1.677422 | 1.318790 | 1.974283 | 0.620406 | 0.697381 |
| 1 | -0.383595 | -0.872986 | -0.637634 | -0.714923 | -0.651718 | -0.632873 | -0.731545 | -0.172043 | -1.116240 | -0.571292 | -1.175192 | -0.132838 | -0.964662 | -1.164859 | -1.667562 | -0.503481 | -0.639797 |
| 2 | -0.797941 | 0.359764 | 0.570009 | -0.179016 | 1.341802 | -0.148877 | -0.040232 | -0.689796 | 1.405761 | -0.229284 | 1.292199 | -0.548306 | 0.279942 | 0.795916 | -0.173803 | -0.503481 | 0.081283 |
| 3 | -0.797941 | -0.872986 | -0.562157 | -0.652917 | -0.505404 | -0.584473 | -0.750748 | -0.172043 | -0.755954 | -0.913301 | -0.558344 | 0.282630 | -0.921458 | -0.903422 | -1.925849 | -0.503481 | -0.592580 |
| 4 | 1.550020 | -0.391768 | 0.419054 | -0.218877 | 0.153006 | -0.003679 | -0.558717 | 1.381217 | 0.324904 | 0.112725 | 0.058504 | -0.132838 | -0.308295 | 0.534480 | -0.823825 | -0.503481 | -0.246275 |
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed
tsne = TSNE(n_components = 2, random_state = 1,perplexity = 35) # Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35
data_air_pol_tsne = tsne.fit_transform(df_scaled) # fit_transform T-SNE on new data
data_air_pol_tsne = pd.DataFrame(data_air_pol_tsne, columns=[0, 1]) # Converting the embeddings to a dataframe
plt.figure(figsize=(7, 7)) # Scatter plot for two components
sns.scatterplot(x=0, y=1, data=data_air_pol_tsne) # Plotting T-SNE
<AxesSubplot:xlabel='0', ylabel='1'>
Observation and Insights:
Think about it:
When the variables used in clustering are highly correlated, it causes multicollinearity, which affects the clustering method and results in poor cluster profiling (or biased toward a few variables). PCA can be used to reduce the multicollinearity between the variables.
# Defining the number of principal components to generate
n = data_model.shape[1] # Storing the number of variables in the data
pca = PCA(n_components = n, random_state = 1) # Initialize PCA with n_components = n and random_state=1
data_pca = pd.DataFrame(pca.fit_transform(df_scaled)) # fit_transform PCA on the scaled data
# The percentage of variance explained by each principal component is stored
exp_var = pca.explained_variance_ratio_
Let's plot the first two components and see how the data points are distributed.
# Scatter plot for two components using the dataframe data_pca
plt.figure(figsize=(7, 7))
sns.scatterplot(x=0, y=1, data=data_pca)
<AxesSubplot:xlabel='0', ylabel='1'>
Let's apply clustering algorithms on the data generated after applying PCA
distortions = [] # Create an empty list
K = range(2, 10) # Setting the K range from 2 to 10
for k in K:
kmeanModel = KMeans(n_clusters=k,random_state=4) # Initialize K-Means
kmeanModel.fit(data_pca) # Fit K-Means on the data
distortions.append(kmeanModel.inertia_) # Append distortion values to the empty list created above
# Plotting the elbow plot
plt.figure(figsize=(16, 8)) # Setting the plot size
plt.plot(K, distortions, "bx-") # Plotting the K on X-axis and distortions on y-axis
plt.xlabel("k") # Title of x-axis
plt.ylabel("Distortion") # Title of y-axis
plt.title("The Elbow Method showing the optimal k") # Title of the plot
plt.show()
In the above plot, the elbow is seen for K=3 and K=5 as there is some drop in distortion at K=3 and K=5.
Think About It:
We can use the silhouette score as a metric for different K values to make a better decision about picking the number of clusters(K).
Silhouette score is one of the methods for evaluating the quality of clusters created using clustering algorithms such as K-Means. The silhouette score is a measure of how similar an object is to its cluster (cohesion) compared to other clusters (separation). Silhouette score has a range of [-1, 1].
Finding silhouette score for each value of K
sil_score = [] # Creating empty list
cluster_list = range(3, 7) # Creating a range from 3 to 7
for n_clusters in cluster_list:
# Initialize K-Means with number of clusters equal to n_clusters and random_state=1
clusterer = KMeans(n_clusters = n_clusters, random_state = 1)
# Fit and predict on the pca data
preds = clusterer.fit(data_pca).predict(data_pca)
print(preds)
# Calculate silhouette score - Hint: Use silhouette_score() function
score = silhouette_score(data_pca, preds)
# Append silhouette score to empty list created above
sil_score.append(score)
# Print the silhouette score
print( "For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
[2 0 1 ... 1 1 0] For n_clusters = 3, the silhouette score is 0.26912817852239623) [2 1 0 ... 0 0 1] For n_clusters = 4, the silhouette score is 0.2714184940821574) [0 1 2 ... 2 2 1] For n_clusters = 5, the silhouette score is 0.23502481034162284) [4 0 1 ... 1 1 0] For n_clusters = 6, the silhouette score is 0.2367264667444932)
From the above silhouette scores, 3 appears to be a good value of K. So, let's build K-Means using K=3.
kmeans = KMeans(n_clusters = 3, random_state = 1) # Initialize the K-Means algorithm with 3 clusters and random_state=1
kmeans.fit(data_pca) # Fitting on the data_pca
KMeans(n_clusters=3, random_state=1)
data_pca["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to the data_pca data
data["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to the whole data
data_model["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to data_model
# Let's check the distribution
data_model["K_means_segments_3"].value_counts()
0 1058 1 606 2 566 Name: K_means_segments_3, dtype: int64
Let's visualize the clusters using PCA
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "K_means_segments_3")
Observations and Insights:
# Taking the cluster-wise mean of all the variables. Hint: First groupby 'data' by 'K_means_segments_3' and then find mean
cluster_profile_KMeans_3 = data.groupby('K_means_segments_3').mean()
# Highlighting the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_3.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_3 | ||||||||||||||||||||||||||||||||
| 0 | 1970.977316 | 35352.136106 | 0.756144 | 0.476371 | 49.297732 | 45.251418 | 5.083176 | 23.678639 | 7.349716 | 5.137996 | 15.243856 | 2.029301 | 2.142722 | 0.573724 | 3.293951 | 6.375236 | 0.068053 | 0.015123 | 0.000000 | 0.000945 | 0.001890 | 0.010397 | 0.084121 | 45.022684 | 1.232514 | 1.653119 | 2.885633 | 101.744802 | 8.039698 | 499.818526 | 0.170132 | 11.313861 |
| 1 | 1965.942244 | 57587.478548 | 0.278878 | 0.839934 | 47.528053 | 453.122112 | 23.174917 | 138.214521 | 30.630363 | 24.759076 | 62.414191 | 3.778878 | 6.498350 | 3.103960 | 7.886139 | 5.755776 | 0.070957 | 0.123762 | 0.021452 | 0.036304 | 0.014851 | 0.008251 | 0.133663 | 50.057756 | 1.118812 | 1.655116 | 2.773927 | 732.315182 | 21.267327 | 594.488449 | 0.400990 | 34.272181 |
| 2 | 1968.183746 | 76370.727915 | 0.037102 | 0.210247 | 50.450530 | 631.501767 | 69.478799 | 467.376325 | 101.742049 | 70.863958 | 78.411661 | 1.349823 | 5.183746 | 6.125442 | 8.291519 | 2.879859 | 0.084806 | 0.134276 | 0.263251 | 0.213781 | 0.031802 | 0.007067 | 0.289753 | 47.816254 | 0.247350 | 1.620141 | 1.867491 | 1419.374558 | 20.950530 | 550.365724 | 1.017668 | 73.334029 |
Observations and Insights:
Let us create a boxplot for each of the variables
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
# Creating boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
sns.set(font_scale=2)
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_means_segments_3'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Cluster 0: belongs to the low income group with most children (kids and teens). They seem to Summary for cluster 0:
Cluster 1: belongs to the medium income group with mostly kids at home. Summary for cluster 1:
Cluster 2: belongs to the high income group with almost no children (kids and teens) at home. Summary for cluster 2:
Think About It:
We can see from the above profiles that K=3 segments the customers into High, Medium and Low-income customers, and we are not getting deep insights into different types of customers. So, let's try to build K=5 (which has another elbow in the Elbow curve) and see if we can get better cluster profiles.
# Dropping labels we got from K=3 since we will be using PCA data for prediction
# Drop K_means_segments_3. Hint: Use axis=1 and inplace=True
data_pca.drop("K_means_segments_3", axis=1, inplace=True)
data.drop("K_means_segments_3", axis=1, inplace=True)
Let's build K-Means using K=5
# Fit the K-Means algorithm using number of cluster as 5 and random_state=0 on data_pca
kmeans = KMeans(n_clusters = 5, random_state = 0)
kmeans.fit(data_pca)
KMeans(n_clusters=5, random_state=0)
# Add K-Means cluster labels to data_pca
# Add K-Means cluster labels to whole data
# Add K-Means cluster labels to data_model
data_pca["K_means_segments_5"] = kmeans.labels_
data["K_means_segments_5"] = kmeans.labels_
data_model["K_means_segments_5"] = kmeans.labels_
# Let's check the distribution
data_pca["K_means_segments_5"].value_counts()
0 1001 2 530 1 451 4 247 3 1 Name: K_means_segments_5, dtype: int64
Let's visualize the clusters using PCA
sns.set(font_scale=1)
# Hint: Use PCA_PLOT function created above
PCA_PLOT(0, 1, data_pca, "K_means_segments_5")
# Take the cluster-wise mean of all the variables. Hint: First groupby 'data' by cluster labels column and then find mean
cluster_profile_KMeans_5 = data.groupby('K_means_segments_5').mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_5.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_5 | ||||||||||||||||||||||||||||||||
| 0 | 1971.301698 | 34762.484515 | 0.758242 | 0.458541 | 49.274725 | 38.295704 | 4.849151 | 21.219780 | 7.055944 | 5.070929 | 14.408591 | 1.900100 | 2.004995 | 0.527473 | 3.210789 | 6.342657 | 0.062937 | 0.012987 | 0.000000 | 0.000999 | 0.001998 | 0.010989 | 0.081918 | 44.698302 | 1.216783 | 1.655345 | 2.872128 | 90.900100 | 7.643357 | 493.825175 | 0.160839 | 10.818857 |
| 1 | 1967.403548 | 70815.707317 | 0.062084 | 0.370288 | 50.982262 | 457.130820 | 67.075388 | 349.798226 | 96.077605 | 69.250554 | 80.465632 | 1.705100 | 5.392461 | 5.232816 | 8.931264 | 3.033259 | 0.019956 | 0.019956 | 0.035477 | 0.050998 | 0.000000 | 0.008869 | 0.064302 | 48.596452 | 0.432373 | 1.647450 | 2.079823 | 1119.798226 | 21.261641 | 527.263858 | 0.190687 | 54.738124 |
| 2 | 1965.545283 | 54538.571698 | 0.362264 | 0.879245 | 47.894340 | 417.652830 | 15.760377 | 112.283019 | 21.528302 | 15.745283 | 55.769811 | 4.152830 | 6.290566 | 2.675472 | 7.143396 | 6.258491 | 0.086792 | 0.126415 | 0.011321 | 0.028302 | 0.011321 | 0.007547 | 0.141509 | 50.454717 | 1.241509 | 1.647170 | 2.888679 | 638.739623 | 20.262264 | 610.262264 | 0.405660 | 31.003787 |
| 3 | 1978.000000 | 51369.000000 | 0.000000 | 0.000000 | 53.000000 | 32.000000 | 2.000000 | 1607.000000 | 12.000000 | 4.000000 | 22.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 38.000000 | 0.000000 | 2.000000 | 2.000000 | 1679.000000 | 1.000000 | 754.000000 | 1.000000 | 1679.000000 |
| 4 | 1969.060729 | 80301.461538 | 0.044534 | 0.161943 | 47.611336 | 866.437247 | 61.890688 | 539.384615 | 89.506073 | 64.340081 | 73.032389 | 1.331984 | 5.530364 | 6.676113 | 7.805668 | 3.356275 | 0.182186 | 0.311741 | 0.566802 | 0.425101 | 0.085020 | 0.004049 | 0.599190 | 46.939271 | 0.206478 | 1.595142 | 1.801619 | 1694.591093 | 21.344130 | 584.076923 | 2.170040 | 83.475670 |
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
sns.set(font_scale=2)
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_means_segments_5'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
sns.set(font_scale=1)
Cluster 0: Summary for cluster 0:___
Cluster 1:___ Summary for cluster 1:___
Cluster 2:___ Summary for cluster 2:___
Cluster 3:___ Summary for cluster 3:___
Cluster 4:___ Summary for cluster 4:___
# Dropping labels we got from K-Means since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("K_means_segments_5", axis=1, inplace=True)
data.drop("K_means_segments_5", axis=1, inplace=True)
From the above profiles, K=5 provides more interesting insights about customer's purchasing behavior and preferred channels for purchasing products. We can also see that the High, Medium and Low income groups have different age groups and preferences, which was not evident in K=3. So, we can choose K=5.
Let's find the silhouette score for K=5 in K-Medoids
kmedo = KMedoids(n_clusters = 5, random_state = 1) # Initializing K-Medoids with number of clusters as 5 and random_state=1
preds = kmedo.fit(data_pca).predict(data_pca) # Fit and predict K-Medoids using data_pca
score = silhouette_score(data_pca, preds) # Calculate the silhouette score
print(score) # Print the score
0.11225735300731468
Observations and Insights:
# Predicting on data_pca and ddding K-Medoids cluster labels to the whole data
data['kmedoLabels'] = preds
# Predicting on data_pca and ddding K-Medoids cluster labels to data_model
data_model['kmedoLabels'] = preds
# Predicting on data_pca and ddding K-Medoids cluster labels to data_pca
data_pca['kmedoLabels'] = preds
# Let's check the distribution
data.kmedoLabels.value_counts()
0 684 3 639 2 304 1 303 4 300 Name: kmedoLabels, dtype: int64
Let's visualize the clusters using PCA
# Hint: Use PCA_PLOT function created above
PCA_PLOT(0, 1, data_pca, "kmedoLabels")
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
kmedo_clustering_segment = data.groupby('kmedoLabels').mean()
# Highlight the maximum average value among all the clusters for each of the variables
kmedo_clustering_segment.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| kmedoLabels | ||||||||||||||||||||||||||||||||
| 0 | 1972.519006 | 32669.786550 | 0.807018 | 0.403509 | 44.461988 | 21.457602 | 3.328947 | 13.428363 | 4.995614 | 3.428363 | 10.295322 | 1.535088 | 1.606725 | 0.290936 | 2.861111 | 6.494152 | 0.081871 | 0.007310 | 0.000000 | 0.000000 | 0.002924 | 0.011696 | 0.076023 | 43.480994 | 1.210526 | 1.652047 | 2.862573 | 56.934211 | 6.293860 | 470.497076 | 0.168129 | 8.523335 |
| 1 | 1969.379538 | 39736.607261 | 0.742574 | 0.673267 | 45.818482 | 178.221122 | 8.557756 | 63.514851 | 14.254125 | 8.828383 | 43.825083 | 4.504950 | 4.442244 | 1.389439 | 4.551155 | 7.729373 | 0.062706 | 0.046205 | 0.000000 | 0.003300 | 0.003300 | 0.016502 | 0.204620 | 46.620462 | 1.415842 | 1.617162 | 3.033003 | 317.201320 | 14.887789 | 755.082508 | 0.320132 | 20.427279 |
| 2 | 1966.430921 | 65041.937500 | 0.131579 | 0.644737 | 30.674342 | 532.453947 | 26.582237 | 179.536184 | 36.779605 | 31.677632 | 45.069079 | 2.569079 | 6.052632 | 3.779605 | 9.069079 | 4.601974 | 0.092105 | 0.259868 | 0.138158 | 0.128289 | 0.049342 | 0.009868 | 0.230263 | 49.569079 | 0.776316 | 1.628289 | 2.404605 | 852.098684 | 21.470395 | 474.516447 | 0.898026 | 40.334456 |
| 3 | 1967.508607 | 72751.917058 | 0.070423 | 0.348983 | 55.028169 | 618.676056 | 66.477308 | 425.378717 | 94.489828 | 66.899844 | 86.555556 | 1.857590 | 5.807512 | 5.758998 | 8.239437 | 3.521127 | 0.078247 | 0.084507 | 0.186228 | 0.156495 | 0.017214 | 0.004695 | 0.230047 | 48.491393 | 0.419405 | 1.624413 | 2.043818 | 1358.477308 | 21.663537 | 589.438185 | 0.752739 | 68.248298 |
| 4 | 1965.630000 | 49596.180000 | 0.426667 | 0.776667 | 69.103333 | 180.066667 | 11.056667 | 61.730000 | 15.426667 | 10.500000 | 30.003333 | 2.726667 | 3.810000 | 1.686667 | 5.330000 | 4.766667 | 0.033333 | 0.050000 | 0.003333 | 0.013333 | 0.000000 | 0.003333 | 0.010000 | 50.370000 | 1.203333 | 1.720000 | 2.923333 | 308.783333 | 13.553333 | 430.203333 | 0.110000 | 21.181011 |
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
sns.set(font_scale=2)
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['kmedoLabels'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
sns.set(font_scale=1)
Cluster 0:__ Summary for cluster 0:___
Cluster 1:___ Summary for cluster 1:___
Cluster 2:___ Summary for cluster 2:___
Cluster 3:___ Summary for cluster 3:___
Cluster 4:___ Summary for cluster 4:___
# Dropping labels we got from K-Medoids since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("kmedoLabels", axis=1, inplace=True)
data.drop("kmedoLabels", axis=1, inplace=True)
Let's find the Cophenetic correlation for different distances with different linkage methods.
The cophenetic correlation coefficient is a correlation coefficient between the cophenetic distances(Dendrogramic distance) obtained from the tree, and the original distances used to construct the tree. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points.
The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.
Cophenetic correlation is the way to compare two or more dendrograms.
Let's calculate Cophenetic correlation for each of the distance metrics with each of the linkage methods
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average"]
high_cophenet_corr = 0 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0] # Creating a list by assigning 0's to it
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(data_pca, metric=dm, method=lm) # Applying different linkages with different distance on data_pca
c, coph_dists = cophenet(Z, pdist(data_pca)) # Calculating cophenetic correlation
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c: # Checking if cophenetic correlation is higher than previous score
high_cophenet_corr = c # Appending to high_cophenet_corr list if it is higher
high_dm_lm[0] = dm # Appending its corresponding distance
high_dm_lm[1] = lm # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.8144235363028587. Cophenetic correlation for Euclidean distance and complete linkage is 0.5382569223048559. Cophenetic correlation for Euclidean distance and average linkage is 0.851319325915757. Cophenetic correlation for Chebyshev distance and single linkage is 0.8247390010547467. Cophenetic correlation for Chebyshev distance and complete linkage is 0.6146660207500972. Cophenetic correlation for Chebyshev distance and average linkage is 0.8045096030731448. Cophenetic correlation for Mahalanobis distance and single linkage is 0.6573005880829239. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.4716566676805196. Cophenetic correlation for Mahalanobis distance and average linkage is 0.67507197359385. Cophenetic correlation for Cityblock distance and single linkage is 0.808691020533747. Cophenetic correlation for Cityblock distance and complete linkage is 0.8294847305072225. Cophenetic correlation for Cityblock distance and average linkage is 0.8613796016560248.
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.8613796016560248, which is obtained with Cityblock distance and average linkage.
Let's have a look at the dendrograms for different linkages with Cityblock distance
#from scipy.cluster.hierarchy import dendrogram, linkage
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Cityblock", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations and Insights:
Think about it:
Let's have a look at the dendrograms for different linkages with Chebyshev distance
# Plot the dendrogram for Chebyshev distance with linkages single, complete and average.
# Hint: Use Chebyshev distance as the metric in the linkage() function
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Chebyshev", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations and Insights:
Let's have a look at the dendrograms for different linkages with Mahalanobis distance
# Plot the dendrogram for Mahalanobis distance with linkages single, complete and average.
# Hint: Use Mahalanobis distance as the metric in the linkage() function
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Mahalanobis", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations and Insights:
Let's have a look at the dendrograms for different linkages with Euclidean distance
# Plot the dendrogram for Euclidean distance with linkages single, complete, average and ward.
# Hint: Use Euclidean distance as the metric in the linkage() function
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="euclidean", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Think about it:
Observations and Insights:
# Initialize Agglomerative Clustering with affinity (distance) as Euclidean, linkage as 'Ward' with clusters=3
HCmodel = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='complete',)
# Fit on data_pca
HCmodel.fit(data_pca)
AgglomerativeClustering(linkage='complete', n_clusters=3)
# Add Agglomerative Clustering cluster labels to data_pca
data_pca["HCLabels"] = HCmodel.labels_
# Add Agglomerative Clustering cluster labels to the whole data
data["HCLabels"] = HCmodel.labels_
# Add Agglomerative Clustering cluster labels to data_model
data_model["HCLabels"] = HCmodel.labels_
# Let's check the distribution
data.HCLabels.value_counts()
0 2225 2 4 1 1 Name: HCLabels, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
PCA_PLOT(0, 1, data_pca, "HCLabels")
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
hc_clustering_segment = data.groupby('HCLabels').mean()
# Highlight the maximum average value among all the clusters for each of the variables
hc_clustering_segment.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HCLabels | ||||||||||||||||||||||||||||||||
| 0 | 1968.880000 | 51684.222921 | 0.444494 | 0.508315 | 49.117753 | 305.523146 | 26.393708 | 164.082697 | 37.703820 | 27.208090 | 44.181124 | 2.317303 | 4.107416 | 2.628764 | 5.822472 | 5.330787 | 0.073258 | 0.074607 | 0.072809 | 0.064719 | 0.013034 | 0.008989 | 0.150112 | 47.120000 | 0.952809 | 1.644494 | 2.597303 | 605.092584 | 14.875955 | 538.098427 | 0.448539 | 32.531491 |
| 1 | 1978.000000 | 51369.000000 | 0.000000 | 0.000000 | 53.000000 | 32.000000 | 2.000000 | 1607.000000 | 12.000000 | 4.000000 | 22.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 38.000000 | 0.000000 | 2.000000 | 2.000000 | 1679.000000 | 1.000000 | 754.000000 | 1.000000 | 1679.000000 |
| 2 | 1977.750000 | 119409.750000 | 0.250000 | 0.250000 | 43.500000 | 19.250000 | 4.750000 | 1663.500000 | 5.250000 | 1.750000 | 1.750000 | 11.250000 | 0.000000 | 26.500000 | 0.250000 | 0.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 38.250000 | 0.500000 | 2.000000 | 2.500000 | 1696.250000 | 38.000000 | 638.000000 | 0.000000 | 46.125115 |
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
sns.set(font_scale=2)
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['HCLabels'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
sns.set(font_scale=1)
Cluster 0:__ Summary for cluster 0:___
Cluster 1:___ Summary for cluster 1:___
Cluster 2:___ Summary for cluster 2:___
Observations and Insights:
# Dropping labels we got from Agglomerative Clustering since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("HCLabels", axis=1, inplace=True)
data.drop("HCLabels", axis=1, inplace=True)
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
# Initializing lists
eps_value = [2,3] # Taking random eps value
min_sample_values = [6,20] # Taking random min_sample value
# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}
from sklearn.cluster import DBSCAN
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0 # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0] # Assigning 0's to the high_i_j list
key = res.keys() # Assigning dictionary keys to a variable called key
for i in key:
z = res[i] # Assigning dictionary values of each i to z
for j in z:
db = DBSCAN(eps=i, min_samples=j).fit(data_pca) # Applying DBSCAN to each of the combination in dictionary
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
silhouette_avg = silhouette_score(data_pca, labels) # Finding silhouette score
print(
"For eps value =" + str(i),
"For min sample =" + str(j),
"The average silhoutte_score is :",
silhouette_avg, # Printing the silhouette score for each of the combinations
)
if high_silhouette_avg < silhouette_avg: # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j
high_i_j[0] = i
high_i_j[1] = j
For eps value =2 For min sample =6 The average silhoutte_score is : 0.19774043234901528 For eps value =2 For min sample =20 The average silhoutte_score is : 0.33772677532880413 For eps value =3 For min sample =6 The average silhoutte_score is : 0.3441301295900913 For eps value =3 For min sample =20 The average silhoutte_score is : 0.3422032926810415
# Printing the highest silhouette score
print("Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(high_silhouette_avg, high_i_j[0], high_i_j[1]))
Highest_silhoutte_avg is 0 for eps = 3 and min sample = 20
Now, let's apply DBSCAN using the hyperparameter values we have received above.
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps = 3)
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data['DBSLabels'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_model
data_model['DBSLabels'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_pca
data_pca['DBSLabels'] = dbs.fit_predict(data_pca)
# Let's check the distribution
data.DBSLabels.value_counts()
0 2072 -1 158 Name: DBSLabels, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
PCA_PLOT(0, 1, data_pca, "DBSLabels")
Observations and Insights:
Think about it:
Note: You can experiment with different eps and min_sample values to see if DBSCAN produces good distribution and cluster profiles.
# Dropping labels we got from DBSCAN since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop("DBSLabels", axis=1, inplace=True)
data.drop("DBSLabels", axis=1, inplace=True)
Let's find the silhouette score for K=5 in Gaussian Mixture
gmm = GaussianMixture(n_components = 5, random_state = 1) # Initialize Gaussian Mixture Model with number of clusters as 5 and random_state=1
preds = gmm.fit(data_pca).predict(data_pca) # Fit and predict Gaussian Mixture Model using data_pca
score = silhouette_score(data_pca, preds) # Calculate the silhouette score
print(score) # Print the score
0.14627588259256083
Observations and Insights:
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to the whole data
data["GMMLabels"] = preds
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_model
data_model["GMMLabels"] = preds
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_pca
data_pca["GMMLabels"] = preds
# Let's check the distribution
data.GMMLabels.value_counts()
1 759 4 606 3 449 0 340 2 76 Name: GMMLabels, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
PCA_PLOT(0, 1, data_pca, "GMMLabels")
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
gmm_clustering_segment = data.groupby('GMMLabels').mean()
# Highlight the maximum average value among all the clusters for each of the variables
gmm_clustering_segment.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GMMLabels | ||||||||||||||||||||||||||||||||
| 0 | 1966.920588 | 63390.779412 | 0.170588 | 0.726471 | 45.932353 | 570.997059 | 45.547059 | 232.017647 | 60.832353 | 50.879412 | 78.338235 | 3.529412 | 6.732353 | 5.052941 | 9.111765 | 5.591176 | 0.102941 | 0.123529 | 0.052941 | 0.076471 | 0.032353 | 0.011765 | 0.194118 | 49.079412 | 0.897059 | 1.608824 | 2.505882 | 1038.611765 | 24.426471 | 642.961765 | 0.582353 | 42.887650 |
| 1 | 1971.996047 | 32166.699605 | 0.852437 | 0.444005 | 48.245059 | 20.276680 | 3.067194 | 12.928854 | 4.731225 | 3.280632 | 10.223979 | 1.772069 | 1.633729 | 0.316206 | 2.889328 | 6.397892 | 0.071146 | 0.003953 | 0.000000 | 0.000000 | 0.002635 | 0.013175 | 0.076416 | 44.003953 | 1.296443 | 1.646904 | 2.943347 | 54.508564 | 6.611331 | 484.523057 | 0.154150 | 7.977926 |
| 2 | 1968.486842 | 72561.671053 | 0.144737 | 0.276316 | 48.500000 | 805.065789 | 60.500000 | 446.223684 | 78.486842 | 58.157895 | 82.368421 | 1.276316 | 3.815789 | 4.434211 | 5.486842 | 4.881579 | 0.131579 | 0.355263 | 0.526316 | 0.355263 | 0.118421 | 0.000000 | 0.526316 | 47.513158 | 0.421053 | 1.592105 | 2.013158 | 1530.802632 | 15.013158 | 607.592105 | 2.013158 | 121.862778 |
| 3 | 1967.632517 | 75883.634744 | 0.020045 | 0.173719 | 50.864143 | 596.213808 | 65.120267 | 458.532294 | 97.810690 | 65.530067 | 74.955457 | 1.093541 | 5.102450 | 5.755011 | 8.285078 | 2.396437 | 0.073497 | 0.102450 | 0.229399 | 0.182628 | 0.011136 | 0.006682 | 0.224944 | 48.367483 | 0.193764 | 1.645880 | 1.839644 | 1358.162584 | 20.236080 | 513.648107 | 0.824053 | 67.764551 |
| 4 | 1967.123762 | 49459.694719 | 0.437294 | 0.740924 | 50.750825 | 233.471947 | 11.709571 | 74.013201 | 16.118812 | 11.414191 | 39.640264 | 3.412541 | 4.998350 | 1.775578 | 5.823432 | 6.037954 | 0.051155 | 0.080858 | 0.001650 | 0.014851 | 0.003300 | 0.004950 | 0.113861 | 48.876238 | 1.178218 | 1.669967 | 2.848185 | 386.367987 | 16.009901 | 556.782178 | 0.265677 | 22.972221 |
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
sns.set(font_scale=2)
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['GMMLabels'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
sns.set(font_scale=1)
Cluster 0: Summary for cluster 0:
Cluster 1: Summary for cluster 1:
Cluster 2: Summary for cluster 2:
Cluster 3: Summary for cluster 3:
Cluster 4: Summary for cluster 4:
! jupyter nbconvert --to html notebook.ipynb
This application is used to convert notebook files (*.ipynb)
to various other formats.
WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.
Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
<cmd> --help-all
--debug
set log level to logging.DEBUG (maximize logging output)
Equivalent to: [--Application.log_level=10]
--show-config
Show the application's configuration (human-readable format)
Equivalent to: [--Application.show_config=True]
--show-config-json
Show the application's configuration (json format)
Equivalent to: [--Application.show_config_json=True]
--generate-config
generate default config file
Equivalent to: [--JupyterApp.generate_config=True]
-y
Answer yes to any questions instead of prompting.
Equivalent to: [--JupyterApp.answer_yes=True]
--execute
Execute the notebook prior to export.
Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
Write notebook output to stdout instead of files.
Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
Run nbconvert in place, overwriting the existing notebook (only
relevant when converting to notebook format)
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
Clear output of current file and save in place,
overwriting the existing notebook.
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
Exclude input and output prompts from converted document.
Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
Exclude input cells and output prompts from converted document.
This mode is ideal for generating code-free reports.
Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
Whether to allow downloading chromium if no suitable version is found on the system.
Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
Disable chromium security sandbox when converting to PDF..
Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
Shows code input. This flag is only useful for dejavu users.
Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
Equivalent to: [--HTMLExporter.embed_images=True]
--log-level=<Enum>
Set the log level by value or name.
Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
Default: 30
Equivalent to: [--Application.log_level]
--config=<Unicode>
Full path of a config file.
Default: ''
Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
The export format to be used, either one of the built-in formats
['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf']
or a dotted object name that represents the import path for an
``Exporter`` class
Default: ''
Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
Name of the template to use
Default: ''
Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
Name of the template file to use
Default: None
Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
as prebuilt extension for the lab template)
Default: 'light'
Equivalent to: [--HTMLExporter.theme]
--writer=<DottedObjectName>
Writer class used to write the
results of the conversion
Default: 'FilesWriter'
Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
PostProcessor class used to write the
results of the conversion
Default: ''
Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
overwrite base name use for output files.
can only be used when converting one notebook at a time.
Default: ''
Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
Directory to write output(s) to. Defaults
to output to the directory of each notebook. To recover
previous default behaviour (outputting to the current
working directory) use . as the flag value.
Default: ''
Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
The URL prefix for reveal.js (version 3.x).
This defaults to the reveal CDN, but can be any url pointing to a copy
of reveal.js.
For speaker notes to work, this must be a relative path to a local
copy of reveal.js: e.g., "reveal.js".
If a relative path is given, it must be a subdirectory of the
current directory (from which the server is run).
See the usage documentation
(https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
for more details.
Default: ''
Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
The nbformat version to write.
Use this to downgrade notebooks.
Choices: any of [1, 2, 3, 4]
Default: 4
Equivalent to: [--NotebookExporter.nbformat_version]
Examples
--------
The simplest way to use nbconvert is
> jupyter nbconvert mynotebook.ipynb --to html
Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf'].
> jupyter nbconvert --to latex mynotebook.ipynb
Both HTML and LaTeX support multiple output templates. LaTeX includes
'base', 'article' and 'report'. HTML includes 'basic', 'lab' and
'classic'. You can specify the flavor of the format used.
> jupyter nbconvert --to html --template lab mynotebook.ipynb
You can also pipe the output to stdout, rather than a file
> jupyter nbconvert mynotebook.ipynb --stdout
PDF is generated via latex
> jupyter nbconvert mynotebook.ipynb --to pdf
You can get (and serve) a Reveal.js-powered slideshow
> jupyter nbconvert myslides.ipynb --to slides --post serve
Multiple notebooks can be given at the command line in a couple of
different ways:
> jupyter nbconvert notebook*.ipynb
> jupyter nbconvert notebook1.ipynb notebook2.ipynb
or you can specify the notebooks list in a config file, containing::
c.NbConvertApp.notebooks = ["my_notebook.ipynb"]
> jupyter nbconvert --config mycfg.py
To see all available configurables, use `--help-all`.